-
Notifications
You must be signed in to change notification settings - Fork 259
Allow more than 2^32 sequences to be clustered #1039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hello @martin-steinegger This is a very impressive and crucial enhancement for large-scale clustering. We are currently facing a project that requires clustering ~11 billion (1.1e10) protein sequences. Could you please advise if there is a version of MMseqs2 (like a branch from this PR) that is already capable of handling a dataset of this scale? If a single run is not yet feasible, what would be the recommended strategy? For example, is the "split-cluster-merge" approach the best practice? Have you conducted any scalability tests or benchmarks for clustering at this unprecedented scale (e.g., tens of billions of sequences)? Any guidance or insights from you would be immensely helpful for our work. Thank you for developing and continuously improving this fantastic tool! |
|
This PR is very much in development and not production ready. Our current recommendation is still too split the databases into 2-3 billion sequence chunks, cluster each separately. Afterwards, continue to merge the chunks until you reach 2-3 billion again and cluster until everything is done. We are of course interested in getting native support into MMseqs2 for this, but this might still take a bit. |
|
We have clustered ~100B with the split and merge strategy before. |
|
@milot-mirdita |
|
@milot-mirdita , can you share the step-by-step example (including commands) for this split-cluster-merge workflow in MMseqs2? |
No description provided.